Depression is a mood disorder that affects more than 264 million people globally, making it one of the leading causes of disability worldwide. It's characterized by symptoms such as profound sadness, the feeling of emptiness, anxiety, sleep disturbance, as well as a general loss of initiative and interest in activities. The severity of a depression is determined by the quantity of symptoms, their seriousness and duration, as well as the consequences on social and occupational function.
One way to classify depression is unipolar and bipolar; unipolar depression refers to major depressive disorder and bipolar depression is a facet of bipolar disorder. They are both genetic mood disorders and share symptoms, but a distincion should be made between the two: bipolar depression is unique in the periodic occurrence of mania, a state associated with inflated self-esteem, impulsivity, increased activity, goal-directed actions, and reduced sleep.
Although there are known, effective treatments for mental disorders, between 76% and 85% of people in low- and middle-income countries receive no treatment for their disorder. One barrier to effective care is inaccurate assessment: in countries of all income levels, people who are depressed are often not correctly diagnosed.
Actigraphs are small motion sensor detectors (accelerometers) that are encased in a unit about the size of a wristwatch, and can be worn continuously for days to months. It is well established that depression is characterized by altered motor activity, and actigraph recordings of motor activity are considered an objective method for observing depression. Despite not being exhaustively studied yet, there is an increasing awareness in the field of psychiatry on how the activity data relates to various mental health issues such as changes in mood, personality, inability to cope with daily problems, or stress and withdrawal from friends and activities.
In the following tutorial, we will walk through the Data Science Pipeline to see if depression states can be accurately predicted through the sensor data recorded by Actigraphs.
References
We'll be looking at Actigraphic data originally collected for a study on motor activity in schizophrenia and major depression. Actigraphs continuously record an activity count proportional to the intensity of movement in one minute intervals. The dataset consists of actigraphy data collected for the condition group (23 unipolar and and bipolar depressed patients) as well as the control group (32 non-depressed contributors). We'll be using the The Montgomery-Asberg Depression Rating Scale (MADRS) score included in the data for each participant to identify the severity of an ongoing depression. The score is based on ten items relevant for depression, which clinicians rate based on observation and conversation with the patient. The sum score (0-60) represents the severity: scores below 10 are classified as an absence of depressive symptoms, and scores above 30 indicate a severe depressive state.
References
First let's import the libraries we'll use throughout the tutorial.
# for data
import os
import glob
import pandas as pd
import datetime as dt
# for plotting
import seaborn as sns
import missingno as msno
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
# other
import warnings
from IPython.display import Image
The dataset contains:
# skipinitialspace skips spaces after the delimeter , (allows blank cells to be filled with NaN)
scores = pd.read_csv('data/scores.csv', skipinitialspace=True)
display(scores)
| number | days | gender | age | afftype | melanch | inpatient | edu | marriage | work | madrs1 | madrs2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | condition_1 | 11 | 2 | 35-39 | 2 | 2 | 2 | 6-10 | 1 | 2 | 19 | 19 |
| 1 | condition_2 | 18 | 2 | 40-44 | 1 | 2 | 2 | 6-10 | 2 | 2 | 24 | 11 |
| 2 | condition_3 | 13 | 1 | 45-49 | 2 | 2 | 2 | 6-10 | 2 | 2 | 24 | 25 |
| 3 | condition_4 | 13 | 2 | 25-29 | 2 | 2 | 2 | 11-15 | 1 | 1 | 20 | 16 |
| 4 | condition_5 | 13 | 2 | 50-54 | 2 | 2 | 2 | 11-15 | 2 | 2 | 26 | 26 |
| 5 | condition_6 | 7 | 1 | 35-39 | 2 | 2 | 2 | 6-10 | 1 | 2 | 18 | 15 |
| 6 | condition_7 | 11 | 1 | 20-24 | 1 | NaN | 2 | 11-15 | 2 | 1 | 24 | 25 |
| 7 | condition_8 | 5 | 2 | 25-29 | 2 | NaN | 2 | 11-15 | 1 | 2 | 20 | 16 |
| 8 | condition_9 | 13 | 2 | 45-49 | 1 | NaN | 2 | 6-10 | 1 | 2 | 26 | 26 |
| 9 | condition_10 | 9 | 2 | 45-49 | 2 | 2 | 2 | 6-10 | 1 | 2 | 28 | 21 |
| 10 | condition_11 | 14 | 1 | 45-49 | 2 | 2 | 2 | 6-10 | 1 | 2 | 24 | 24 |
| 11 | condition_12 | 12 | 2 | 40-44 | 1 | 2 | 2 | 6-10 | 2 | 2 | 25 | 21 |
| 12 | condition_13 | 14 | 2 | 35-39 | 1 | 2 | 2 | 11-15 | 2 | 2 | 18 | 13 |
| 13 | condition_14 | 14 | 1 | 60-64 | 1 | 2 | 2 | 6-10 | 2 | 2 | 28 | 19 |
| 14 | condition_15 | 13 | 2 | 55-59 | 2 | 2 | 2 | 11-15 | 1 | 1 | 14 | 18 |
| 15 | condition_16 | 16 | 1 | 45-49 | 2 | 2 | 2 | 11-15 | 1 | 2 | 13 | 17 |
| 16 | condition_17 | 13 | 1 | 50-54 | 1 | 2 | 2 | 6-10 | 1 | 2 | 17 | 15 |
| 17 | condition_18 | 13 | 2 | 40-44 | 3 | 2 | 2 | 11-15 | 2 | 2 | 18 | 15 |
| 18 | condition_19 | 13 | 2 | 50-54 | 2 | 2 | 1 | 16-20 | 2 | 2 | 26 | 21 |
| 19 | condition_20 | 13 | 1 | 30-34 | 2 | 1 | 1 | 6-10 | 1 | 2 | 27 | 25 |
| 20 | condition_21 | 13 | 2 | 35-39 | 2 | 2 | 1 | 6-10 | 2 | 2 | 26 | 21 |
| 21 | condition_22 | 14 | 1 | 65-69 | 2 | 2 | 1 | NaN | 2 | 2 | 29 | 28 |
| 22 | condition_23 | 16 | 1 | 30-34 | 2 | 2 | 1 | 16-20 | 2 | 2 | 29 | 23 |
| 23 | control_1 | 8 | 2 | 25-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 24 | control_2 | 20 | 1 | 30-34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25 | control_3 | 12 | 2 | 30-34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 26 | control_4 | 13 | 1 | 25-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 27 | control_5 | 13 | 1 | 30-34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 28 | control_6 | 13 | 1 | 25-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 29 | control_7 | 13 | 1 | 20-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 30 | control_8 | 13 | 2 | 40-44 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 31 | control_9 | 13 | 2 | 30-34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 32 | control_10 | 8 | 1 | 30-34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 33 | control_11 | 13 | 1 | 45-49 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 34 | control_12 | 14 | 1 | 60-64 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 35 | control_13 | 13 | 1 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 36 | control_14 | 13 | 1 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 37 | control_15 | 11 | 1 | 45-49 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 38 | control_16 | 13 | 2 | 40-44 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 39 | control_17 | 9 | 1 | 45-49 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 40 | control_18 | 13 | 2 | 20-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 41 | control_19 | 13 | 1 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 42 | control_20 | 13 | 1 | 35-39 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 43 | control_21 | 8 | 1 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 44 | control_22 | 13 | 1 | 25-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 45 | control_23 | 13 | 1 | 20-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 46 | control_24 | 13 | 2 | 20-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 47 | control_25 | 13 | 1 | 65-69 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 48 | control_26 | 13 | 1 | 35-39 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 49 | control_27 | 13 | 2 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 50 | control_28 | 16 | 2 | 45-49 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 51 | control_29 | 13 | 2 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 52 | control_30 | 9 | 2 | 35-39 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 53 | control_31 | 13 | 1 | 20-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 54 | control_32 | 14 | 2 | 25-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Right away we can see there is a lot of missing data, mostly comprised of the depression data for the control group. We expect this to be missing for the control group, because depression data is only collected for those in the condition group; thus it's considered missing at random (MAR). The control group is also missing data for education, marriage, and work. This data could also be considered MAR, because it looks as though only number, days, gender, and age were collected for the control group; therefore the data is missing because it's part of the control group. Since we'll be focusing mainly on the Actigraph data for this group, it shouldn't be a concern for the rest of the tutorial. For the condition group, there are three missing melancholia scores, and one missing education range value. It's not immediately clear which kind of missing data it is; after the next step we'll take a closer look to see if there's any correlation between the condition groups' missing data.
Notes
skipinitialspace=True will remove preceding whitespace before data that is not missing. For our data, we don't need to preserve whitespace so it doesn't cause any issues. NaN). However, we can still display the values as integers without modifying the actual data using pandas display options.Since we'll be looking at differences between the condition and control group, let's split the scores data into a control group DataFrame and a condition group DataFrame. Even though we could use the numeric indices to select a subset of the DataFrame, in the event there is a greater number of rows / the partition of indices isn't clear, it may be useful to select rows based on the column values (e.g. control or condition) as follows.
# select rows whose number column starts with either control or condition, and make DataFrame
control_scores = scores.loc[scores['number'].str.startswith('control')]
condition_scores = scores.loc[scores['number'].str.startswith('condition')]
# make index correspond to number column
control_scores.reset_index(drop=True, inplace=True)
control_scores.index += 1
condition_scores.index += 1
# display control group data
print('\nControl Group:')
display(control_scores)
# format floats to display as int, add title, and display condition group data
print('\n\nCondition Group:')
pd.options.display.float_format = '{:,.0f}'.format
display(condition_scores)
Control Group:
| number | days | gender | age | afftype | melanch | inpatient | edu | marriage | work | madrs1 | madrs2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | control_1 | 8 | 2 | 25-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | control_2 | 20 | 1 | 30-34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | control_3 | 12 | 2 | 30-34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | control_4 | 13 | 1 | 25-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | control_5 | 13 | 1 | 30-34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 6 | control_6 | 13 | 1 | 25-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 7 | control_7 | 13 | 1 | 20-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 8 | control_8 | 13 | 2 | 40-44 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 9 | control_9 | 13 | 2 | 30-34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 10 | control_10 | 8 | 1 | 30-34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 11 | control_11 | 13 | 1 | 45-49 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 12 | control_12 | 14 | 1 | 60-64 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 13 | control_13 | 13 | 1 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 14 | control_14 | 13 | 1 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 15 | control_15 | 11 | 1 | 45-49 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 16 | control_16 | 13 | 2 | 40-44 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 17 | control_17 | 9 | 1 | 45-49 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 18 | control_18 | 13 | 2 | 20-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 19 | control_19 | 13 | 1 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 20 | control_20 | 13 | 1 | 35-39 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 21 | control_21 | 8 | 1 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 22 | control_22 | 13 | 1 | 25-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 23 | control_23 | 13 | 1 | 20-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 24 | control_24 | 13 | 2 | 20-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25 | control_25 | 13 | 1 | 65-69 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 26 | control_26 | 13 | 1 | 35-39 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 27 | control_27 | 13 | 2 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 28 | control_28 | 16 | 2 | 45-49 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 29 | control_29 | 13 | 2 | 50-54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 30 | control_30 | 9 | 2 | 35-39 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 31 | control_31 | 13 | 1 | 20-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 32 | control_32 | 14 | 2 | 25-29 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Condition Group:
| number | days | gender | age | afftype | melanch | inpatient | edu | marriage | work | madrs1 | madrs2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | condition_1 | 11 | 2 | 35-39 | 2 | 2 | 2 | 6-10 | 1 | 2 | 19 | 19 |
| 2 | condition_2 | 18 | 2 | 40-44 | 1 | 2 | 2 | 6-10 | 2 | 2 | 24 | 11 |
| 3 | condition_3 | 13 | 1 | 45-49 | 2 | 2 | 2 | 6-10 | 2 | 2 | 24 | 25 |
| 4 | condition_4 | 13 | 2 | 25-29 | 2 | 2 | 2 | 11-15 | 1 | 1 | 20 | 16 |
| 5 | condition_5 | 13 | 2 | 50-54 | 2 | 2 | 2 | 11-15 | 2 | 2 | 26 | 26 |
| 6 | condition_6 | 7 | 1 | 35-39 | 2 | 2 | 2 | 6-10 | 1 | 2 | 18 | 15 |
| 7 | condition_7 | 11 | 1 | 20-24 | 1 | NaN | 2 | 11-15 | 2 | 1 | 24 | 25 |
| 8 | condition_8 | 5 | 2 | 25-29 | 2 | NaN | 2 | 11-15 | 1 | 2 | 20 | 16 |
| 9 | condition_9 | 13 | 2 | 45-49 | 1 | NaN | 2 | 6-10 | 1 | 2 | 26 | 26 |
| 10 | condition_10 | 9 | 2 | 45-49 | 2 | 2 | 2 | 6-10 | 1 | 2 | 28 | 21 |
| 11 | condition_11 | 14 | 1 | 45-49 | 2 | 2 | 2 | 6-10 | 1 | 2 | 24 | 24 |
| 12 | condition_12 | 12 | 2 | 40-44 | 1 | 2 | 2 | 6-10 | 2 | 2 | 25 | 21 |
| 13 | condition_13 | 14 | 2 | 35-39 | 1 | 2 | 2 | 11-15 | 2 | 2 | 18 | 13 |
| 14 | condition_14 | 14 | 1 | 60-64 | 1 | 2 | 2 | 6-10 | 2 | 2 | 28 | 19 |
| 15 | condition_15 | 13 | 2 | 55-59 | 2 | 2 | 2 | 11-15 | 1 | 1 | 14 | 18 |
| 16 | condition_16 | 16 | 1 | 45-49 | 2 | 2 | 2 | 11-15 | 1 | 2 | 13 | 17 |
| 17 | condition_17 | 13 | 1 | 50-54 | 1 | 2 | 2 | 6-10 | 1 | 2 | 17 | 15 |
| 18 | condition_18 | 13 | 2 | 40-44 | 3 | 2 | 2 | 11-15 | 2 | 2 | 18 | 15 |
| 19 | condition_19 | 13 | 2 | 50-54 | 2 | 2 | 1 | 16-20 | 2 | 2 | 26 | 21 |
| 20 | condition_20 | 13 | 1 | 30-34 | 2 | 1 | 1 | 6-10 | 1 | 2 | 27 | 25 |
| 21 | condition_21 | 13 | 2 | 35-39 | 2 | 2 | 1 | 6-10 | 2 | 2 | 26 | 21 |
| 22 | condition_22 | 14 | 1 | 65-69 | 2 | 2 | 1 | NaN | 2 | 2 | 29 | 28 |
| 23 | condition_23 | 16 | 1 | 30-34 | 2 | 2 | 1 | 16-20 | 2 | 2 | 29 | 23 |
We could drop all the NaN columns for the control group, but we'll keep it as is for now so it corresponds to the condition group DataFrame.
As mentioned previously, the condition group has missing data. It doesn't look as though it depends on any data in other columns, so it may be missing completely at random (MCAR). We can use different plots/figures to display any correlation between missing values. With only two columns containing missing data, the following missingno heatmap is not too informative, but it could be useful for larger datasets with more missing data.
msno.heatmap(condition_scores, figsize=(6,5))
<AxesSubplot:>
Since the goal in identifying the type of missing data is to determine if and how it might affect future analyses, let's use seaborn to plot the scores we'll be using in the analysis (MADRS 1 and 2) on the X and Y axis, then see if the missing melancholia values correlate in any way. We can use the color and size to differentiate the NA values, as well as style to mark the education range values.
cond_NA = condition_scores.fillna('NA') # dataframe for purpose of plotting, with filled NA values
sns.relplot(x=cond_NA['madrs1'], y=cond_NA['madrs2'], style=cond_NA['edu'], size=cond_NA['melanch'], size_order=['NA', 2.0, 1.0], hue=cond_NA['melanch'], hue_order=[1.0, 2.0, 'NA'], palette='icefire', alpha=0.8, height=6).set(title='Melancholia with respect to MADRS scores')
<seaborn.axisgrid.FacetGrid at 0x7fcabc801d60>
Other than 2 of the 3 NA points overlapping other plotploints, the figure doesn't show a particular correlation, e.g. that all the missing melancholia scores have the same MADRS 1 or 2 scores, or that all three have the same education range value. As a result, we can move forward with our analysis and treat the missing condition group data as MCAR.
Now that we've organized the scores data, we need to get all the Actigraph data. After getting the control group data, we will repeat the process for the condition group. Originally this step involved constructing two DataFrames—one for the control group and one for the conditional group—but considering the dates and amounts of measured Actigraph data don't necessarily align among the participants, there is no benefit of storing all the Actigraph data together in one DataFrame. Instead, we'll make a control list and a condition list, and fill them with DataFrames for each csv file (each participant).
# get list of csv filenames in control folder (data/control is relative path)
control_filenames = glob.glob(os.path.join('data/control', '*.csv'))
# use list comprehension to create list of DataFrames from csv files
# select timestamp and activity columns (date is repetitive) and parse dates so they are converted to DateTime objects
control_groups = [pd.read_csv(filename, parse_dates=['timestamp'])[['timestamp', 'activity']] for filename in control_filenames]
# print first 3 DataFrames in list to show how data is stored
control_groups[0:3]
[ timestamp activity
0 2003-11-11 09:00:00 9
1 2003-11-11 09:01:00 7
2 2003-11-11 09:02:00 7
3 2003-11-11 09:03:00 7
4 2003-11-11 09:04:00 7
... ... ...
29033 2003-12-01 12:53:00 7
29034 2003-12-01 12:54:00 7
29035 2003-12-01 12:55:00 5
29036 2003-12-01 12:56:00 5
29037 2003-12-01 12:57:00 7
[29038 rows x 2 columns],
timestamp activity
0 2004-02-24 09:00:00 0
1 2004-02-24 09:01:00 0
2 2004-02-24 09:02:00 0
3 2004-02-24 09:03:00 0
4 2004-02-24 09:04:00 0
... ... ...
21819 2004-03-10 12:39:00 0
21820 2004-03-10 12:40:00 0
21821 2004-03-10 12:41:00 0
21822 2004-03-10 12:42:00 0
21823 2004-03-10 12:43:00 0
[21824 rows x 2 columns],
timestamp activity
0 2003-03-18 15:00:00 19
1 2003-03-18 15:01:00 97
2 2003-03-18 15:02:00 586
3 2003-03-18 15:03:00 1183
4 2003-03-18 15:04:00 266
... ... ...
51584 2003-04-23 11:44:00 5
51585 2003-04-23 11:45:00 5
51586 2003-04-23 11:46:00 5
51587 2003-04-23 11:47:00 5
51588 2003-04-23 11:48:00 360
[51589 rows x 2 columns]]
# get list of csv filenames in control folder (data/control is relative path)
condition_filenames = glob.glob(os.path.join('data/condition', '*.csv'))
# use list comprehension to create list of DataFrames from csv files
# select timestamp and activity columns (date is repetitive) and parse dates so they are converted to DateTime objects
condition_groups = [pd.read_csv(filename, parse_dates=['timestamp'])[['timestamp', 'activity']] for filename in condition_filenames]
# print first 3 DataFrames in list to show how data is storeds
condition_groups[0:3]
[ timestamp activity
0 2003-05-07 15:00:00 1468
1 2003-05-07 15:01:00 1006
2 2003-05-07 15:02:00 468
3 2003-05-07 15:03:00 306
4 2003-05-07 15:04:00 143
... ... ...
38921 2003-06-03 15:41:00 9
38922 2003-06-03 15:42:00 35
38923 2003-06-03 15:43:00 13
38924 2003-06-03 15:44:00 0
38925 2003-06-03 15:45:00 0
[38926 rows x 2 columns],
timestamp activity
0 2005-08-11 09:00:00 0
1 2005-08-11 09:01:00 0
2 2005-08-11 09:02:00 0
3 2005-08-11 09:03:00 0
4 2005-08-11 09:04:00 0
... ... ...
25905 2005-08-29 08:45:00 0
25906 2005-08-29 08:46:00 0
25907 2005-08-29 08:47:00 0
25908 2005-08-29 08:48:00 0
25909 2005-08-29 08:49:00 0
[25910 rows x 2 columns],
timestamp activity
0 2003-05-07 12:00:00 0
1 2003-05-07 12:01:00 143
2 2003-05-07 12:02:00 0
3 2003-05-07 12:03:00 20
4 2003-05-07 12:04:00 166
... ... ...
23239 2003-05-23 15:19:00 0
23240 2003-05-23 15:20:00 0
23241 2003-05-23 15:21:00 0
23242 2003-05-23 15:22:00 0
23243 2003-05-23 15:23:00 533
[23244 rows x 2 columns]]
As the comments in the code describe, we parse the timestamps while reading the csv so they'll be formatted as DateTime objects, which may come in handy later on in the tutorial. Additionally, the first three elements of each list are displayed above, showing the partial DataFrames of Actigraph data and their row count (number of Actigraph recordings). Let's use matplotlib to get an idea of the data we're working with.
warnings.filterwarnings('ignore')
plt.figure(figsize=(20, 16))
plots = []
grp = 1
for row in range(1):
for col in range(1, 3):
ax = plt.subplot2grid((6,4), (row, col))
ax.scatter((control_groups[grp])['timestamp'], control_groups[grp]['activity'], alpha=0.1, s=0.1)
ax.set_xticklabels(ax.get_xticks(), rotation = 45)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
ax.set_title('Control ' + str((row*4)+(col+1)))
ax.set_xlabel(xlabel='Date (year and month)')
ax.set_ylabel(ylabel='Activity (intensity)')
grp += 1
plt.savefig('control_scatter.png', bbox_inches='tight')
#Image(filename='control_scatter.png')
warnings.filterwarnings('ignore')
plt.figure(figsize=(20, 16))
plots = []
grp = 1
for row in range(1):
for col in range(1, 3):
ax = plt.subplot2grid((6,4), (row, col))
ax.scatter((condition_groups[grp])['timestamp'], condition_groups[grp]['activity'], alpha=0.1, s=0.5)
ax.set_xticklabels(ax.get_xticks(), rotation = 45)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
ax.set_title('Condition ' + str((row*4)+(col+1)))
ax.set_xlabel(xlabel='Date (year and month)')
ax.set_ylabel(ylabel='Activity (intensity)')
grp += 1
plt.savefig('condition_scatter.png', bbox_inches='tight')
#Image(filename='condition_scatter.png')
We can see from these 4 samples that the Actigraph data is spread over 1 or 2 months, and there is a lot of variation in activity intensity over time. The subplots have their own y-axis tick values, so we need to carefully look at the range if we want to compare between the 4 samples. From this initial glance at the data, we can see the upper range of the control groups' activity reaches 2,000 whereas it's less consistent for the condition groups. Condition 3 has a majority of the activity below 500, but there are scattered points up to 1000. Condition 2 has a majority of activity below 1000, but there are spikes going up to 3000-3500. Overall we've learned there are certain days or weeks during the couple months recorded where the activity spikes, and the intensity value is usually somewhere in the 0-5000 range.
For the first step in the exploratory data analysis, let's see if we can better understand the range and other statistics about each group using boxplots.
control_filenames = glob.glob(os.path.join('data/control', '*.csv'))
group = 1
control_activity = pd.DataFrame()
for filename in control_filenames:
file_df = pd.read_csv(filename, parse_dates=['timestamp'])[['timestamp', 'activity']]
#control_act['time'+str(group)] = file_df['timestamp']
control_activity[str(group)] = file_df['activity']
group += 1
display(control_activity[0:5])
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9 | 0 | 19 | 0 | 0 | 1,221 | 5 | 3 | 219 | 3 | ... | 82 | 30 | 3 | 7 | 6 | 0 | 60 | 3 | 321 | 0 |
| 1 | 7 | 0 | 97 | 0 | 0 | 667 | 3 | 3 | 712 | 2 | ... | 0 | 0 | 0 | 5 | 4 | 3 | 0 | 0 | 115 | 0 |
| 2 | 7 | 0 | 586 | 0 | 0 | 469 | 3 | 3 | 1,076 | 0 | ... | 0 | 151 | 0 | 5 | 4 | 3 | 264 | 7 | 240 | 0 |
| 3 | 7 | 0 | 1183 | 0 | 0 | 349 | 3 | 3 | 948 | 0 | ... | 0 | 156 | 0 | 5 | 4 | 0 | 662 | 0 | 888 | 0 |
| 4 | 7 | 0 | 266 | 0 | 0 | 178 | 3 | 3 | 1,042 | 0 | ... | 0 | 1509 | 0 | 5 | 4 | 0 | 293 | 0 | 1378 | 0 |
5 rows × 32 columns
condition_filenames = glob.glob(os.path.join('data/condition', '*.csv'))
group = 1
condition_activity = pd.DataFrame()
for filename in condition_filenames:
file_df = pd.read_csv(filename, parse_dates=['timestamp'])[['timestamp', 'activity']]
#condition_activity['time'+str(group)] = file_df['timestamp']
condition_activity[str(group)] = file_df['activity']
group += 1
display(condition_activity[0:5])
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1468 | 0 | 0 | 0 | 0 | 249 | 0 | 111 | 91 | 97 | ... | 0 | 5 | 0 | 0 | 5 | 0 | 3 | 0 | 510 | 0 |
| 1 | 1006 | 0 | 143 | 0 | 0 | 69 | 0 | 66 | 0 | 498 | ... | 0 | 3 | 0 | 3 | 5 | 53 | 190 | 0 | 637 | 349 |
| 2 | 468 | 0 | 0 | 0 | 0 | 116 | 0 | 157 | 0 | 249 | ... | 0 | 3 | 0 | 0 | 5 | 0 | 8 | 0 | 598 | 111 |
| 3 | 306 | 0 | 20 | 0 | 0 | 258 | 0 | 73 | 0 | 396 | ... | 0 | 3 | 3 | 0 | 5 | 0 | 3 | 0 | 251 | 38 |
| 4 | 143 | 0 | 166 | 0 | 0 | 152 | 0 | 142 | 0 | 209 | ... | 0 | 283 | 0 | 0 | 5 | 0 | 14 | 0 | 93 | 3 |
5 rows × 23 columns
fig, axes = plt.subplots(1, 2, figsize=(16, 8))
control_activity_plot = control_activity.boxplot(figsize=(24, 16), grid=False, \ boxprops=dict(color="r"), flierprops=dict(marker='.', markeredgecolor='darkred', markersize=1, alpha=0.5), \ whiskerprops=dict(color='r'), medianprops=dict(color='purple'), ax=axes[0])
control_activity_plot.set_title('Control Group Activity') control_activity_plot.set_xlabel('Group') control_activity_plot.set_ylabel('Activity (intensity)')
condition_activity_plot = condition_activity.boxplot(figsize=(24, 16), grid=False, flierprops=dict(marker='.', markeredgecolor='b', markersize=2, alpha=0.7), ax=axes[1]) condition_activity_plot.set_title('Condition Group Activity') condition_activity_plot.set_xlabel('Group') condition_activity_plot.set_ylabel('Activity (intensity)')
fig.tight_layout()